Overview

This report provides an evaluation of the accuracy and precision of probabilistic forecasts submitted to the COVID-19 Forecast Hub over the last 10 weeks. The forecasts evaluated were submitted during the time period from November 10, 2020 through January 18, 2021. The revision dates of this data was calculated as of 2021-01-26.

In this weekly report we are evaluating forecasts made in 57 different locations (US on a national level, 50 states, and 6 territories), for 4 horizons over 10 submission weeks. We are evaluating incident cases, and incident deaths.

Additionally, we have included a historical score report to aggregate scores from forecasts that have been submitted to the forecast hub since the first week in April.

In collaboration with the US CDC, our team collects COVID-19 forecasts from dozens of teams around the globe. Each Monday evening or Tuesday morning, we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the target submissions.

Typically on Wednesday or Thursday of each week, a summary of the week’s forecast from the COVID-19 Forecast Hub, including the ensemble forecast, appear on the official CDC COVID-19 forecasting page.

This report has been created as a collaborative effort between the COVID-19 Forecast hub and the CMU Delphi group.

Incident Cases

Truth data and Model Submissions

Truth data

This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated.

Number of Locations

This figure shows the number of locations each model submitted a weekly incidence case forecast for the last 10 evaluated weeks. The maximum number of locations is 57, which includes all 50 states, a National level forecast, and 6 US territories.

The dates listed on the x-axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.

The number of models who submitted forecasts for incident cases is BPagano-RtDriven, CEID-Walk, COVIDhub-baseline, COVIDhub-ensemble, CU-nochange, CU-scenario_high, CU-scenario_low, CU-scenario_mid, CU-select, Columbia_UNC-SurvCon, Covid19Sim-Simulator, CovidAnalytics-DELPHI, DDS-NBDS, IEM_MED-CovidProject, IQVIA_ACOE-STAN, IowaStateLW-STEM, JCB-PRM, JHUAPL-Bucky, JHU_CSSE-DECOM, JHU_IDD-CovidSP, JHU_UNC_GAS-StatMechPool, Karlen-pypm, LANL-GrowthRate, LNQ-ens1, Microsoft-DeepSTIA, OliverWyman-Navigator, OneQuietNight-ML, QJHong-Encounter, RobertWalraven-ESG, SigSci-TS, UCF-AEM, UCLA-SuEIR, UCSB-ACTS, UChicagoCHATTOPADHYAY-UnIT, UMich-RidgeTfReg, USACE-ERDC_SEIR, USC-SI_kJalpha, USC-SI_kJalpha_RF, UVA-Ensemble, UpstateSU-GRU, Wadhwani_AI-BayesOpt. The number of models that submitted forecasts for all 10 weeks was 37. The number of teams that submitted forecasts for all 57 locations was 35.

## [1] "2021-01-09"

LeaderBoard Table

Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative mean average errors (MAE) of each model.

To calculate each column in our table, multiple inclusion criteria were applied.

  • The first column in the table lists all models that have contributed forecasts for 5 or more weeks total, or if it has submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

  • The next column lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.

  • Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have target_end_dates in 50% or more of the evalated weeks in the most recent evaluation period.

  • Columns 5 and 6 show the calibration scores of recent models, at a 50% coverage and 95% coverage level respectively. Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95. The inclusion criteria for these columns is the same as for the first two columns.

  • Column 7 shows the number of historical models a team has submitted. All teams that have submitted at least 5 forecasts and/or 2 forecasts out of the last 3 weeks is included in this count.

  • Lastly, columns 8 and 9 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have forecasts with target_end_dates in 50% or more of evaluated weeks in the historical evaluation period.

  • All adjusted relative WIS and adjusted relative MAE calculations in this table are calculated using a pairwise approach to account variation in the difficulty of forecasting different weeks and locations. Models with an adjusted relative WIS or MAE lower than 1 are more accurate than the baseline and models with an adjusted relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident cases. The code for this comparison can be found here. A preprint on this method for calculating the WIS can be found here.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabalistic forecasts for all 50 states.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning the first week in April at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.

To view specific teams, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, hightlight that section on the graph or use the zoom functionality.

1 Week Horizon WIS

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is often larger variation in error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot.

4 Week Horizon 95% Coverage

We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger variation in error for the 4 week horizon compared to the 1 week horizon.

Incident Deaths

Truth data and Model Submissions

Truth Data

This plot shows the observed number of incident deaths over the evaluation period.

Number of Locations

In the 10 week evaluation period, the evaluated Saturdays are 2020-11-21 through 2021-01-23. models submitted incident death forecasts. The number of models who submitted forecasts for incident deaths is 51. The number of models that submitted forecasts for all 10 was 43. The number of teams that submitted forecasts for all locations was 10.

The figure below shows the number of locations that each model submitted incident death forecasts for during this evaluation period. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.

LeaderBoard Table

In order to calculate each column in our table, different inclusion criteria were applied.

  • The first column in the table lists all models that have contributed forecasts for 5 or more weeks total, or if it has submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.

  • The next column lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.

  • Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have target_end_dates in 50% or more of the evalated weeks in the most recent evaluation period.

  • Columns 5 and 6 show the calibration scores of recent models, at a 50% coverage and 95% coverage level respectively. Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95. The inclusion criteria for these columns is the same as for the first two columns.

  • Column 7 shows the number of historical models a team has submitted. All teams that have submitted at least 5 forecasts and/or 2 forecasts out of the last 3 weeks is included in this count.

  • Lastly, columns 8 and 9 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have forecasts with target_end_dates in 50% or more of evaluated weeks in the historical evaluation period.

  • All adjusted relative WIS and adjusted relative MAE calculations in this table are calculated using a pairwise approach to account variation in the difficulty of forecasting different weeks and locations. Models with an adjusted relative WIS or MAE lower than 1 are more accurate than the baseline and models with an adjusted relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident cases. The code for this comparison can be found here. A preprint on this method for calculating the WIS can be found here.

Evaluation by Week

In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.

For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.

To view specific teams, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.

1 Week Horizon

4 Week Horizon WIS

In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.

1 Week Horizon 95% Coverage

The black line represents 95%

4 Week Horizon 95% Coverage

The black line represents 95%